arXiv: 1505.00553v2 [stat.ML] 1 Dec 2016 


l 


On Regret-Optimal Learning in Decentralized 
Multi-player Multi-armed Bandits 

Naumaan Nayyar, Dileep Kalathil and Rahul Jain 


Abstract —We consider the problem of learning in single-player 
and multiplayer multi-armed bandit models. Bandit problems 
are classes of online learning problems that capture exploration 
versus exploitation tradeoffs. In a multi-armed bandit model, 
players can pick among many arms, and each play of an arm 
generates an i.i.d. reward from an unknown distribution. The 
objective is to design a policy that maximizes the expected reward 
over a time horizon for a single player setting and the sum of 
expected rewards for the multiplayer setting. In the multiplayer 
setting, arms may give different rewards to different players. 
There is no separate channel for coordination among the players. 
Any attempt at communication is costly and adds to regret. We 
propose two decentralizable policies, E 3 (E-cubed) and E 3 -TS, that 
can be used in both single player and multiplayer settings. These 
policies are shown to yield expected regret that grows at most as 
0(log 1+<5 T) (and Oflog T) under some assumption). It is well 
known that Oflog T) is the lower bound on the rate of growth 
of regret even in a centralized case. The proposed algorithms 
improve on prior work where regret grew at Oflog 2 T). More 
fundamentally, these policies address the question of additional 
cost incurred in decentralized online learning, suggesting that 
there is at most an 5-factor cost in terms of order of regret. This 
solves a problem of relevance in many domains and had been 
open for a while. 

Index Terms —Learning algorithms; Decision making; Decen¬ 
tralized systems; Optimization problems; Cognitive systems. 


I. Introduction 

Multi-armed bandit (MAB) models represent an exploration 
versus exploitation trade-off where the player must choose 
between exploring the environment to find better options, and 
exploiting based on her current knowledge to maximize her 
utility. These models are widely applicable in many application 
like display advertisements, sensor networks, route planning 
and spectrum sharing. The model can be understood through 
a simple game of choosing between two coins with unknown 
biases. The coins are tossed repeatedly and one of them is 
chosen at each instant. If at a given instance, the chosen coin 
turns up heads, we get a reward of $1, otherwise we get no 
reward. It is known that one of the two coins has a better bias, 
but the identity of the coin is not known. The question is, 
what is the optimal ‘learning’ policy that helps maximize the 
expected reward, i.e., to discover which coin has a better bias 
and at the same time maximize the cumulative reward as the 
game is played. Note that the player doesn’t know the value of 
the biases as well as she has no prior probability distribution 
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on these values. This motivates the non-Bayesian setting. The 
formulation where the player has a prior distribution on the 
parameters is called Bayesian multi-armed bandits. 

The idea of multi-armed bandit models dates back to 
Thompson in and the first rigorous formulation is due to 
Robbins (2). The single player multi-armed bandit problem 
in a non-Bayesian setting was first formulated by Lai and 
Robbins 0. Any bandit policy that makes the best choice 
more than a constant fraction of the time is said to have sub- 
linear regret. Regret measures the performance of any strategy 
formally against the best policy that could be employed if the 
distribution parameters were known. It was shown in 0 that 
there is no learning policy that asymptotically has expected 
regret growing slower than Oflog7’). A learning scheme was 
also constructed that asymptotically achieved this lower bound. 

This model was subsequently studied and generalized by 
many researchers. In j4j, Anantharam et al. generalized it to 
the case of multiple plays, i.e., the player can pick multiple 
arms (or coins) when there are more than 2 arms. In 0, 
Agrawal proposed a sample mean based index policy that 
asymptotically achieved Oflog T) regret. For the special case 
of bounded support for rewards, Auer et al. 0 introduced 
a simple index-based policy, UCBi, that achieved logarithmic 
expected regret over finite time horizons. UCBi has since 
become the benchmark to compare new algorithms against 
because of its power and simplicity. 

Recently, policies based on Thompson Sampling (TS) CD 
have experienced a surge of interest due to their much better 
empirical performance 0. It is a probability matching policy 
which, unlike the UCB-class of policies that use a deterministic 
confidence bound, draws samples from a distribution to deter¬ 
mine which arm to play based on the probability of its being 
optimal. The logarithmic regret performance of the policy was 
not proved until very recently 0. 0 introduced the Bayes- 
UCB algorithm which also uses use a Bayesian approach for 
analyzing the regret bound for stochastic bandit problems. 

Deterministic sequencing algorithms which have separate 
exploration and exploitation phases have also appeared in 
the literature as an alternative to the joint exploration and 
exploitation approaches of UCB-like and probability matching 
algorithms. Noteworthy among these are the Phased Explo¬ 
ration and Greedy Exploitation policy for linear bandits m 
that achieves 0(y/T ) regret in general and Oflog(T)) regret 
for finitely many linearly parametrized arms. Other noteworthy 
algorithms include the logarithmic regret achieving determin¬ 
istic sequencing of exploration and exploitation policy, DU 
with i.i.d. setting and Da with Markovian setting. Single¬ 
player bandit problems have also been looked at in the PAC 
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framework, for instance, in m. However, we restrict our 
attention to performance in the expected sense in this work. 

In addition to single player bandits, there has been grow¬ 
ing interest in multiplayer learning in multi-armed bandits, 
motivated by distributed sensor networks, wireless spectrum 
sharing and in particular cognitive radio networks. Suppose 
there are two wireless users trying to choose between two 
wireless channels. Each wireless channel is random, and looks 
different to each user. If channel statistics were known, we 
would try to determine a matching wherein the expected sum- 
rate of the two users is maximized. But the channel statistics 
are unknown, and they must be learnt by sampling the chan¬ 
nels. Moreover, the two users have to do this independently 
and cannot share their observations as there is no dedicated 
communication channel between them. They, however, may 
communicate implicitly for coordination but this would come 
at the expense of reduced opportunities for rewards or benefits, 
and thus would add to regret. One can easily imagine a more 
general network setting with M users and TV channels. This 
immediately gives rise to two questions. First, what is the 
lower bound for decentralized learning? That is, is there an 
inherent cost of decentralization in such network? And second, 
can we design a simple learning algorithm with provably 
optimal performance guarantees, in the context of such a 
decentralized network problem? 

Policies for decentralized learning with sublinear regret have 
appeared in the literature for various models. When arms 
were restricted to have the same rewards for different users, 
Anandkumar et al. m showed that logarithmic regret was 
achievable as the problem reduces to a ranking problem that 
can be solved in constant time in a decentralized manner. 
Similar works have also appeared for i.i.d. ED E9 and 
Markovian ini m arm reward settings. Relaxing this as¬ 
sumption makes the problem more complicated as it now 
becomes a bipartite matching problem and no decentralized 
algorithm performs quick enough. In our previous work El, 
we proposed a policy, dUCB 4 that achieved 0(log 2 T) regret 
through a recurrent negotiation mechanism between players. 
However, the answers to the two questions above remained 
unknown. In a similar work na, authors address the problem 
of decentralized multi-armed bandits. While they address the 
same problem as ours, the emphasis is on the stability of this 
decentralized setting with minimum possible communication. 
Also, they don’t provide any optimality guarantees as com¬ 
pared to the optimal centralized learning problem. However, 
our paper assumes that players in the system remains the same. 
In lfT8l users can arrive and leave at random times. Landgren 
et. al. Ifl9l uses a multi-armed bandit model for cooperative 
decision making problem in the context of running a consensus 
algorithm. Their setting is very different from the problem 
considered in this paper. 

In this paper, we do not present an information theoretic 
lower bound on decentralized learning in a multiplayer multi¬ 
armed bandit problems. Such a result would be very interesting 
as it will also yield insight into the exact role of information 
sharing between players for a decentralized policy to work 
without an increase in expected regret. However, we man¬ 
aged to partially answer both questions above through two 


new decentralizable policies, E 3 and E 3 -TS, where E 3 stands 
for Exponentially-spaced Exploration and Exploitation policy, 
which we also call as E-cubed. 

Both policies yield expected regret of the order 0(log 1+<5 T) 
(O(logT) under some assumptions) in both single and mul¬ 
tiplayer settings. The policies are based on exploration and 
exploitation in pre-determined phases such that over a long 
time horizon T, there are only logarithmically many slots in 
the exploration phases. It is well known that the optimal order 
of regret that can be achieved is O(logT) 0. These policies 
suggest an answer to the fundamental question of inherent cost 
to decentralize, that there is no cost to the order optimality, at 
least up to an log 5 T factor. An asymptotic lower bound for the 
decentralized MAB problem (similar to that of the centralized 
MAB in 0) is an important future research question. 

The policies introduced in this paper, and the corresponding 
results hold even when the rewards are Markovian. However, 
we only present the i.i.d. case here and refer readers to our 
earlier paper El for ideas on extensions to the Markovian 
setting. Extensive simulations were conducted to evaluate the 
empirical performances of these policies and compared to 
prior work in the literature, including the classical UCBi and 
TS policies. The decentralized policies dE 3 and dE 3 -TS are 
compared with the previously known dUCB 4 policy. 

The rest of the paper is organized as follows. Section [D] 
describes the model and problem formulations for single and 
multiplayer bandits. Section [Hi] describes relevant prior work 
in the area. The new policies E 3 and E 3 -TS, and their multi¬ 
player counterparts, dE 3 and dE 3 -TS are described and studied 
in Section [TV] Section [V] presents empirical performances of 
new and previous policies. 

II. Model and problem formulation 

In this section, we describe problem formulations for single 
and multiplayer bandits. The single player formulation has 
been well-studied in literature, for instance, by Auer et al. 
0 and others. The multiplayer formulation is much newer, 
and has appeared in our previous work El- 

A. Single player model 

We consider an TV-armed bandit problem. At each instant t, 
an arm k is chosen, and a reward Xk(t) is generated, from an 
independent and identically distributed (i.i.d.) random process 
with a fixed but unknown distribution. The processes are 
assumed to have bounded support, without loss of generality, 
in [0,1]. Arm reward distributions have means p^ that are 
unknown. When choosing an arm, the player has access to 
the history of rewards and actions, TL{t), with TT(0) := 0. 
Denote the arm chosen at time t by a(t) £ A := {1, ...,TV}. 
A policy a is a sequence of maps a(t) : %{t) —> A that 
specifies the arm chosen at time t. The player’s objective is 
to choose a policy that maximizes the expected reward over a 
finite time horizon T. 

If the mean rewards of the arms were known, the problem 
is trivially solved by always playing the arm with the highest 
mean reward, i.e., a(t) = argmaxi<j<jv Pi, Vi. When the 
mean rewards are not known, the notion of regret is used 
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to compare policies. Regret is the difference between the 
cumulative rewards obtained by a policy a and when playing 
the most rewarding arm all the time. Formally, the player’s 
objective is to minimize the expected regret over all causal 
policies a as defined above, which is given by. 


n a (T) 


T/ii - E a 


r t 


yx ait ) (t) 

_t =i 


(i) 


where arm 1 is taken to have the greatest mean w.l.o.g. 

In practical implementations of bandit algorithms in low- 
power settings such as sensor networks where the implemen¬ 
tation of any learning/control policy should consume minimum 
amount of energy, it will be useful to include a computation 
cost as well. This is particularly the case when the algorithms 
must solve combinatorial optimization problems that are NP- 
hard. Such costs arise in decentralized settings in particular, 
where algorithms pay a communication cost for coordination 
between the decentralized players. For example, as we shall 
see later in our decentralized learning algorithm, the players 
may have to spend many time slots for coming up with a 
bipartite matching. We model it as a constant C units of 
cost each time an index is computed by the policy. With this 
refinement, the regret of a policy a that computes its indices 
m(T) times over a time horizon T is. 


N 

U a (T) := - y HjE a [nj(T)] + CE a [m(T)}, (2) 

3 = 1 


where rij(T) is the number of times arm j is played. 


B. Multiplayer model 

We now describe the generalization of the single player, 
where we consider an A'-armed bandit with M players. We 
will refer to arms as channels interchangeably. There is no 
dedicated communication channel for coordination among the 
players. However, we do allow players to communicate with 
one another by playing arms in a certain way, e.g., arm 1 
signals a bit ‘O’, arm 2 can signal a bit ‘1’. This of course 
will add to regret, and hence such communication comes at a 
cost. We assume that N > M. 

At any instant t, each player choose one arm from the set of 
N arms or takes no action (i.e., selects no arm). If more than 
one player picks the same arm, we regard it as a collision and 
this interference results in zero reward for those players. The 
rest of the model is similar to the single player case. Arm k 
chosen by player i generates an i.i.d. reward from an 

unknown distribution, which has bounded support, w.l.o.g., in 
[0,1]. Let tH,k denote the unknown mean of Si tk (t). 

Let be the reward that player i gets from playing 

arm k at time t. Thus, if there is no collision, X,j.(t) = 
Si t k(t). Denote the action of player i at time t by a,i(t) G 
A := {1,...,7V}. Let Yi(t ) be the communication mes¬ 
sage from player i at time t and FA,(f) be the mes¬ 
sages from all the other players except player i at time t. 
Then, the history seen by player i at time t is 7L;(i) = 
{(a i ( 1 ),X ii o i (i)( 1 ),F- J ( 1 )),--- ,(oi(f - l),X i|04(t _ 1 )(f - 
l),Y_i(t - 1))} with Ui{ 0) = 0. A policy a* = (a^f))^ 


for player i is a sequence of maps oti(t) : 7 iiit) — > A that 
specifies the arm to be played at time t. 

The players have a team objective', they want to maximize 
the expected sum of rewards X i ai u\{t)\ over 

some time horizon T. Let V{N) denote the set of possible 
permutations of the N arms. If //, , were known, the optimal 
policy is clearly to pick the optimal bipartite matching between 
arms and players (which may not be unique), 

M 

k** G arg max y (3) 

kG P(N)^-' 

1=1 


When expected rewards are not known, players must pick 
learning policies that minimize the expected regret, defined for 
policies a = (a*, 1 < i < M) as. 


n a (T) = T y - E a 

i 


r t m 


'y, 'y, 


(4) 


As in the single player model, we consider a refinement of 
the regret to factor in computational or communication costs. 
Communication costs are justified because known distributed 
algorithms for bipartite matching ED, E2 require a certain 
amount of information exchange over multiple time slots. This 
cost will depend on the specific algorithm. Here, however, we 
will just consider an ‘abstract’ cost C. 

Let C units of cost be incurred each time this occurs, and 
let m{t) be the number of times it happens in time t. Then, 
the expected regret for policy a to be minimized is. 


M 

X a {T) =Ty p ijk ** - E, 


i= 1 

+ CE a [m(T)]. 


" T M 

y^y^Xj, oatt) w 


.4=1 2=1 


(5) 


where k * is the optimal matching as defined in <|3). 


III. Prior work 

We now briefly describe the key features and results of 
existing single and multiplayer bandit policies. 


A. Single player policies 

We focus on three different MAB algorithms that capture 
different classes of policies. 

In 0, Auer et al. proposed an index based policy, UCBi 
which achieves logarithmic regret. It worked by playing the 
arm with the largest value of sample mean plus a confidence 
bound. The interval shrank deterministically as the arm got 
played more often and traded-off exploration and exploitation. 
It was shown in 0 that the expected regret incurred by the 
policy over a horizon T is bounded by, 

N 1 2 N 

KucbAT) < Slog(T) E XT + (1 + y) £ (6) 

j> 1 7>i 

where A j := p-\ — p :j . 

Thompson Sampling (TS) is a probability-matching policy 
that has been around for quite some time in the literature 11 
although it was not well-studied in the context of bandit 
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problems until quite recently J7), J8). Arms are played ran¬ 
domly according to the probability of their being optimal. 
As an arm gets played more often, its sampling distribution 
become narrower. Unlike a fully Bayesian method such as 
Gittins Index (23), TS can be implemented efficiently in bandit 
problems. The regret of the policy was shown in (8) to be 
bounded by, 

n TS (T)<o (E^) log(T) )’ (7) 

' j >i j 7 

where the constants have been omitted for brevity. A stronger 
upper bound for the case of Bernoulli rewards that matches 
the asymptotic rate lower bound in Lai and Robbins 0 is 
given in (24). Numerically, TS has been found to empirically 
outperform UCBj in most settings 0 , ( 3 ). 

UCB 4 is another confidence-bound based index policy that 
was proposed recently to overcome some of the shortcomings 
of the UCBi policy, namely its reliance on index computation in 
each time step and the difficulty in extending the algorithm to 
a multiplayer setting. It works by cleverly choosing a sequence 
of times to compute the UCBi index. The expected regret was 
shown in (13 to be bounded by, 

T^ucBj ( T) < A max ( ^ ^ ^ + 2n ), (8) 

j> i j 

where A max = max A j. 

It can be shown that UCBi and TS incur linear regret if 
computation cost is included in the model. The expected regret 
of the UCB 4 algorithm over a time horizon T with computation 
cost C is bounded by fI71 . 

Kucb 4 (T) < (A max +C(l+log(T)) ( £ I21 g (r) +27V). 

V j>i 

(9) 

Thus, expected regret is 0(log 2 (T)). 


with increasing frame length addresses the case when A m ; n is 
unknown ED- 

IV. New (near-)logarithmic bandit policies 

In this section, we present our work in developing two 
closely related policies for single player bandit problems and 
their generalizations to multiplayer settings. 

A. Single player policies: E 3 and E 3 -TS 

E 3 and E 3 -TS are phased policies detailed in Algorithms [T] 
and [^respectively. Their key difference from the previous poli¬ 
cies is that they have deterministic exploration and exploitation 
phases. In the following, an epoch is defined to comprise of 
one exploration phase and one exploitation phase. 

Exploration phase: During an exploration phase, the player 
tries out different arms in a round-robin fashion and computes 
indices for each arm. At the end of the phase, the player 
chooses the arm with the maximum value of the index. The 
index computation differs for E 3 and E 3 -TS policies. 

Exploitation phase: In this phase, the player plays the arm 
that was chosen at the end of the previous exploration phase. 
No index computation happens during the exploitation phase 
and the player sticks to her decisions during this phase. The 
length of the exploitation phase doubles each successive epoch. 

Algorithm 1 : Exponentially-spaced Exploration and Exploita¬ 
tion policy (E 3 ) 

l: Initialization: Set t = 0 and 1 = 1; 

2: while ( t < T) do 

3: Exploration Phase: Play each arm j, 1 < j < N, 7 

number of times; 

4: Update the sample mean Xj(l), 1 < j < N; 

5: Compute the best arm j*(l) := argmaxi<j<jv Xj(l); 

6: Exploitation Phase: Play arm j*{l) for 2 l time slots; 

7 : Update 1 4 — t + N'y + 2 l , l 4 — l + 1; 

8 : end while 


B. Multiplayer policies 

The major issues that are encountered in decentralizing ban¬ 
dit policies are coordination among players and finite precision 
of indices being communicated. The dUCB 4 policy m was the 
first such policy that did not assume identical channel rewards 
for different players. The policy is a natural decentralization 
of UCB 4 that uses Bertsekas’ auction algorithm |251 for dis¬ 
tributed bipartite matching. 

If A m i„ is known, the expected regret of dUCB 4 is, 

7^-ducB i(T) < (LA max + C'(L)(1 + log(T)))x 
/ 4M 3 (M + 2)iVlog(T) 


1 [ \ ,,,, so 1 NM( 2 M+l) 

V (A min - ((M + l)e ) 2 v ' 

where L is the frame length, A m j n is the minimum difference 
between the optimal and the next best permutations, A max is 
the maximum difference in rewards between permutations, and 
e is the precision input of the distributed bipartite matching al¬ 
gorithm. Also, C(L) indicates that the cost of communication 
and computation is a function of the frame length L. Thus, 
^•ducB 4 (T') = 0(l°g 2 (T'))- A slight modification to the policy 


Algorithm 2 : Exponentially-spaced Exploration and Exploita¬ 
tion algorithm-TS (E 3 -TS) 

1: Initialization: Set l = 1 and t = 0. For each arm i = 
1,2,..., A, set Si = 0 ,Fi = 0; 

2 : while (t < T ) do 

3: Exploration Phase: Play each arm j , 1 < j < A r , 7 

number of times; 

4: For each play of each arm i, store reward as f, (f); 

5: Perform a Bernoulli trial with success probability fj(£) 

and observe output 77 (i); 

6: If r, (t) = 1, then set Si = Si + 1, else Fi = Fi + 1; 

7: Sample 9i(l) from Beta(Si + l,Tj + 1) distribution; 

8: Compute the best arm j*(l) := argmaxi<j<jv 0j(l); 

9: Exploitation Phase: Play arm j*(l) for 2 l time slots; 

10 : Update t i — t -f ./V 7 T 2\ l i — / -j- 1; 

li: end while 


E 3 and E 3 -TS, while largely similar, differ in how they 
choose the arm to play during the exploitation phase. While 
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E 3 uses the simple sample mean value, E 3 -TS draws from a 
/3-distribution in a manner similar to the TS policy. 

The /3-distribution is chosen in E 3 -TS due to convenient pos¬ 
terior form after Bernoulli observations. A /3(a, b) -distribution 
prior results in a posterior of /3(a + l, b) or /3(a, b+ 1) depend¬ 
ing on success or failure of the Bernoulli trial, respectively. 

We now give the performance bounds for the policies with 
an index computation cost C in the main result of this section. 
Both algorithms will be analyzed concurrently as their proof 
techniques are largely similar. 

The following concentration inequality will be used in the 
analysis and is introduced here for the reader’s ease. 

Fact 1: Chernoff-Hoeffding inequality (26). 

Let Xi,... ,X t be a sequence of real-valued random vari¬ 
ables, such that, for each i £ { 1 ,.. -, £}, 0 < Xi < 1 
and V\Xi\Xi-i] = p, where Ti = a(Xi,...,Xi). Let 
S t = Xi- Then for all a > 0, 

P ^ p + a) < e -2a 2 tj P ^ < n _ < e - 2 “ 2 *. 

We now give the main result of this section. 

Theorem 1. (Regret bounds for E 3 and E 3 -TS policies) 

Let A min and A max denote the differences between the 
mean rewards of the optimal arm, and the second best and 
worst arms, respectively. 

(i) If A min is known, set 7 = 1 and 7 p = f^f—]. 

min min 

Then, the expected regret of the E 3 and E-TS policies with 
computation cost C is, 

n E s (T) < NA max 7 log(T) + NClog(T) + 8NA max (10) 
7 ^ E 3 -ts(T) < AA max7/3 log(T) + NClog(T) + 16NA max 

( 11 ) 


(ii) If A min is not known, choose 7 = 7t , where { 7t \ is a 
positive sequence such that 7t —> 00 as t —> 00 . Then, 

K(T) < JVA max7T log(T) + NC log(T) + NA max B (12) 


where 1Z(T) = max{7?-E 3 (7 1 ), 7 ?-e 3 -ts(? 1 )} and B is a constant 
independent of T. In particular, for 7t = log 5 t , 5 £ (0,1), 

7 l(T) < NA max log (1+5 \T) + NC log(T) + NA max B(6) 


where B{5) = 2 l ^,l(S) = (A^/4)" 1 ^ 


Proof is given in Appendix |A| 

Remark 1. (i) For the sake of clarity, we will assume that 
7 1 changes at the beginning of every exploration phase, (ii) 
Part 1 of the above theorem assumes the knowledge A m ; n 
in order to define 7 . In fact we only need to know a lower 
bound on A m i n . If A lb < A m j n , we can fix 7 = \-*4— ]. It 
is straightforward to show that, with a slight modification of 
the proof, the theorem still holds. Obviously, a tighter lower 
bound on A m j n results in a tighter bound on the regret. 


Although the bounds of E 3 and E 3 -TS are poorer than 
UCBi and TS, they lend themselves to easy decentralization 
and can be extended to multiplayer bandit problems with 
minimal effort. Performances of all single player algorithms 


are compared in Section V-A 


B. Multiplayer policies: dE 3 and dE 3 -TS 

In this section, we present multiplayer generalizations of 
the E 3 and E 3 -TS policies that were described in the previous 
section. They are detailed in Algorithms [3] and [4] respectively. 
They are also divided into exploration and exploitation phases. 

Exploration phase: During exploration phases, players take 
turns to explore arms in a round-robin fashion. At the end of an 
exploration phase, the players update their index values (either 
g tJ , or 0-i.j)- Then they participate in a distributed bipartite 
matching to determine the players to channels assignments. 
This requires some additional time slots and comes at a 
cost, and contributes to regret. This communication and the 
distributed bipartite matching process is compressed into line 
5 in Algorithm [3] and line 8 in Algorithm [4] as a call to dBM. 

Distributed bipartite matching (dBM): Let g(t) 
(gij(t),l < i < M, 1 < j < N) denote a vector 
of indices. In both algorithms, dBM e (g(t)) refers to 
an e-optimal distributed bipartite matching algorithm, 
such as Bertsekas’ auction algorithm (25l . that yields a 
matching k*(t) = (/cj(t),..., k* M (t)) £ V{N) such that 

E^tSajwW > J2tii 9i,ki{t)) - e, Vk e V{N), k k*. 

The details of dBM implementation is described in Section 

hwd] 

Exploitation phase: In this phase, players stick to the 
allocation given to them at the end of the distributed bipartite 
matching process. No index computation is carried out in this 
phase. The length of the exploitation phase doubles in each 
successive epoch. 


Algorithm 3 : dE 3 

l: Initialization: Set t = A 7 and l = 1. 

2 : while (t < T) do 

3: Exploration Phase: Each player i, 1 < i < M, plays 

each arm j, 1 < j < N, 7 number of times; 

4: Update the index = Xij(l); 

5: Participate in the dBM e (g(()) algorithm to obtain a 

match k*(l); 

6 : Exploitation Phase: Each player i plays arm k*(l) for 

2 l time slots; 

7: t<-t + MN 7 + 2 l ,l<-l + l; 

8 : end while 


C. Regret analysis 

In both these algorithms, the total regret can be thought 
as the sum of three different regret terms. The time slots 
spent in exploration are considered to contribute to regret 
as the first term, 7 Z°(T). At the end of every exploration 
phase, a bipartite matching algorithm is run and each run 
adds cost C to the second term of regret TZ C (T). The cost C 
depends on two parameters: (a) the precision of the bipartite 
matching algorithm <7 > 0 , and (b) the precision of the index 
representation £2 > 0. A bipartite matching algorithm has an 
ei-precision if it gives an ei-optimal matching. This would 
happen, for example, when such an algorithm is run only 
for a finite number of rounds. The index has an e 2 -precision 
if any two indices are not distinguishable if they are closer 
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Algorithm 4 : dE 3 -TS 

1 : Initialization: Set t = N 7 and l = 1. For each arm 
j = 1,2, ...,N and player i = 1 set S, h] = 0, 

F i,j = 0; 

2 : while (t < T) do 

3: Exploration Phase: Each player i, 1 < i < M, plays 

each arm j, 1 < j < N, 7 number of times; 

4: For each play of each arm j, store reward as r l J (t); 

5: Perform a Bernoulli trial with success probability f,;,j (t) 

and observe output 

6 : If rij(t) = 1, set Sij = Sij + 1, else F i: j = Fij + 1. 

7: Sample Oij from Beta(Si } j + 1, F. L j + 1) distribution. 

8 : Participate in the dBM c (0) algorithm to obtain a match 

**( 0 ; 

9: Exploitation Phase: Each player i plays arm k* (l) for 

2 l time slots; 

10: t = t + MN 7 + 2 l , l = 1 + 1; 

li: end while 


than £ 2 . This can happen, for instance, when indices must be 
communicated to other players with a finite number of bits. 
Thus, the cost C is a function of 7 and e 2 , and can be denoted 
as C(e 1 , 62 ), with C(ei,£ 2 ) -4 oc as ei or e 2 -A 0. Since, ei 
and £2 are the parameters that are fixed a priori, we consider 
e = min(£i,£ 2 ) to specify both precisions. We shall denote 
this computation and communication cost by C (e) as different 
communication methods and implementations of distributed 
bipartitie matching will give different costs. 

The third term in the regret expression, 7Z 1 (T), comes from 
non-optimal matchings in the exploitation phase, i.e., if the 
matching k*(Z) is not the optimal matching k**. Thus, we 
have the total expected regret of the dE 3 and dE 3 -TS policies 
to be given by, 

n{T) = H° (t) + n 1 (t) + n c {T). ( 1 3) 

We now give the main results of this section. 

Theorem 2. (i) Let e > 0 be the precision of the bi¬ 
partite matching algorithm and the precision of the index 
representation. If A m i n is known, choose e such that 0 < 
e < A min /(M + 1), set 7 = [2M 2 /(A min - (M + l)e) 2 ] 
and 7/3 = \8M 2 / (A m - ln — (M + l)e) 2 ]. Then, the expected 
regrets of the dE 3 and dE 3 -TS policies are, 

F d& (T) < M7VA max7 log(T) + MN C(e) log(T) 

+ 8 MAA max , (14) 

£e»-ts(T) < M7VA max7/3 log(T) + MN C{e) log(T) 

+ 16MWA max . (15) 

Note that, in the above expressions, e is a chosen constant. 
Thus, 7 Z(T) = 0(log(T)) for both policies. 

(ii) If A m ; n is not known, choose 7 = 7 t, where { 7 1 } is a 
positive sequence such that "ft —> 00 as t —> 00 . Also choose 
e = £ t , where {e t } is a positive sequence such that £ f -> 0 as 
t —¥ 00 . Then, 

U d (T) < MAA max7T log(T) + MNC(e T ) log(T) 

+ MNB (16) 


where TZ d (T) = max{ 7 t dE 3 (T), 7 t dE 3 _ TS (T)} and B is a 
constant independent of T. In particular, for C(e) = £ , 

choose "ft = log 5 t, et = log -5 1, S € ( 0 , 1 ), and we get 

n d (T) < MATA max log ( 1 + 5 ) (T) +MNlog (1+s \T) 

+ MNB{5) (17) 

where B{8) = bo2 l ( s \l(5) = (A^^/4 ) -1 / 15 and bo is a 
constant independent of 8. 

Proof is given in Appendix |B| 


D. Distributed Bipartite Matching 

Both dE 3 algorithm and dE 3 -TS algorithm use the distributed 


bipartite matching algorithm as a subroutine. In Section IV-B 


we have given an abstract description of this distributed bipar¬ 
tite matching algorithm. We now present one such algorithm, 
namely, Bertsekas’ auction algorithm |2J ], and its distributed 
implementation. We note that the presented algorithm is not 
the only one that can be used. Both dE 3 algorithm and dE 3 -TS 
algorithm will work with a distributed implementation of any 
bipartite matching algorithm, e.g. algorithms given in ( 22 l . 

Consider a bipartite graph with M players on one side, 
and N arms on the other, and M < N. Each player 1 has a 
value pij for each arm j. Each player knows only his own 
values. Let us denote by k**, a matching that maximizes the 
matching surplus . fijjXj j, where the variable x i ? is 1 if i 
is matched with j, and 0 otherwise. Note that ff- Xij < 1, Vj, 
and Ej x-i,j < 1, Vi. Our goal is to find an £-optimal matching. 
We call any matching k* to be £-optimal if EiMi.fe**!*) ~ 
Ei Bi,k* (i) A £• 


Algorithm 5 : dBM c ( Bertsekas Auction Algorithm) 

1 : All players i initialize prices pj = 0, V channels j ; 

2 : while (prices change) do 

3: Player i communicates his preferred arm j* and bid bi = 

ma — pj) — second.maxj(/iij ~ Pj) + jp to all other 
players. 

4: Each player determines on his own if he is the winner i * on 

arm j; 

5: All players set prices pj = 

6: end while 


Here, second.max_, is the second highest maximum over all 
j. The best arm for a player i is army* = argma Xj(pij-pj). 
The winner i* on an arm j is the one with the highest bid. 

The following lemma in ED establishes that Bertsekas’ 
auction algorithm will find the e-optimal matching in a finite 
number of steps. 

Lemma 1. / 127V Given e > 0, Algorithm [j] with rewards fi^ j, 
for player i playing the jth arm, converges to a matching 
k* such that Ei ~ E* ^ 6 where k** is an 

optimal matching. Furthermore, this convergence occurs in 
less than (M 2 /e iterations. 

Our only assumption here is going to be that each user can 
observe a channel, and determine if there was a successful 
transmission on it, a collision, or no transmission, in a given 
time slot. This consists of J rounds. In each round, users 
transmit in a round robin fashion, where she can signal 
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her channel preferences using [logM] bits and bid values 
(difference of top two indices) using [logl/ei] bits. The 
number of rounds J is chosen so that the dBM algorithm (based 
on Algorithm [5]) returns an e 2 -optimal matching. More details 
on this implementation is given in ifTTl . 

V. Simulations 

We conducted extensive simulations comparing the perfor¬ 
mances of the proposed policies with prior work. The results 
are presented in the respective sections below. 

A. Single player bandit policies 

For the single player setting, we considered a four-armed 
bandit problem with rewards for arms drawn independently 
from Bernoulli distributions with means 0.1, 0.5, 0.6, 0.9. 
The scenario was simulated over a fixed time horizon T = 
2 , 000,000 timeslots and the performance of the proposed 
single-player policies was evaluated. The performance of each 
policy was averaged over 10 sample runs and the results pre¬ 
sented here. Different true means and distributions were also 
considered and they gave similar rankings for the algorithms. 
In the interest of space, those scenarios are not presented. 

In Figure [l] the single player policies proposed in this paper, 
E 3 and E 3 -TS, are compared with the benchmark UCBi polic 
A m ; n is assumed to be known (0.1) and, consequently, 7 
fixed. The bound for E 3 -TS is also shown with the das he 
line. It can be observed that, although, all three policies ha' 
logarithmic order of regret performance in time, the ne 
E 3 and E 3 -TS policies perform slightly worse than the UCI 
policy. This is attributable to the deterministic exploratic 
phase length which must take into account the worst-ca: 
scenario. However, as we shall see in the next section, th 
gives us a significant performance advantage in the multiplay 
setting. 



Fig. 1: Figure showing growth of cumulative regret of the 
E 3 , E 3 -TS and UCBi algorithms for a four-armed single-player 
bandit problem with true means [0.1, 0.5, 0.6, 0.9] (no compu¬ 
tation cost), with time plotted on log scale. 

Note that in Figure [T] computation cost is assumed to 
be zero. If computation cost were included, E 3 and E 3 -TS 
would retain their logarithmic regret performance. However, 
the cumulative regret of UCBi would grow linearly, just as 
with TS IfTTl . 


B. Multiplayer bandit policies 

We now present the empirical performance of the pro¬ 
posed dE 3 and dE 3 -TS policies. We consider a three-player, 
three-armed bandit setting. Rewards for each arm are gener¬ 
ated independently from a Bernoulli distribution with means 
0.2, 0.25, 0.3 for player 1, 0.4, 0.6, 0.5 for player 2 and 
0.7, 0.9, 0.8 for player 3. A time horizon spanning 20 epochs 
was considered, e = 0.001 was used as the tolerance for the 
bipartite matching algorithm, which was done using dBM e , a 
distributed implementation of Bertsekas’ auction algorithm. 
The performance of each policy was averaged over 10 sample 
runs. 7 was set equal to 100 for dE 3 and 400 for dE 3 -TS (see 
analysis for the reason for differing 7 ’s). A fixed per unit cost 
each time the distributed bipartite matching algorithm dBM is 
run, is included in the setting to model communication cost 
in the decentralized setting. 

The plot of the growth of cumulative regret with time of 
dE 3 , dE 3 -TS and dUCB 4 is shown in Figure [ 2 ] We can see 
that the logarithmic regret performance of dE 3 and dE 3 -TS 
clearly outperforms the log 2 T -regret performance of our ear¬ 
lier dUCB 4 policy E3. The dashed line curve is the theoretical 
upper bound on the performance of dE 3 -TS. 


3 # 10 5 Performance of different distributed learning policies 


-dE 3 


- Upper Bound 
—dE 3 -TS 


... dUCB 

4 

• 


_ ' 


0 1 ... 11 1 -■. . 

nr 10 4 10 5 10 6 


Timeslots 

Fig. 2: Figure showing growth of cumulative regret of the dE 3 
and dE 3 -TS algorithms for a three-player, three-armed ban¬ 
dit setting with true means [0.20.250.3; 0.40.60.5; 0.70.90.8] 
(communication cost included), with time plotted on log scale. 


VI. Conclusion 

We designed two closely related single player and mul¬ 
tiplayer bandit policies that achieve logarithmic or near- 
logarithmic regret performance depending on the assumptions 
of the model. Both policies have deterministic exploration 
and exploitation phases, which make them well-suited to 
decentralization for use in the multiplayer setting. 

Performances of these policies were compared to prior work 
in the literature. They were shown to outperform previous 
policies for multiplayer bandits, but not for the single player 
model due to the deterministic phases of these new policies. 
While we have approached logarithmic regret performance un¬ 
der certain assumptions in the multiplayer model, the question 
of whether a policy under truly general conditions can achieve 
fully logarithmic regret remains open. 
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Appendix A 
Proof of Theorem 1 

In the following proof, the subscript (3 for 7,3 will be omitted 
where the context refers to both E 3 and E 3 -TS together. Also, 

7 Z will be used to denote regret for either policy when both 
are within context. 

A. A min is known: 

We denote the expected regrets incurred in the exploration 
phases with 7 Z°(T), exploitation phases with 1Z T (T) and due 
to computation with Tl c {T). Then, 

t^(t) = iz 0 {T) + tz c (t) + idyr), (i8) 

for both E 3 and E 3 -TS policies. 

Let T be in the /< r th exploitation epoch. By construction, 

T > Njl 0 + 2 l ° - 2. Thus, logT > l 0 and, 

N 

n°(T) = 7 / 0 ^ Aj < N^l 0 A max < A 7 log(T)A max , 

7=2 

(19) 

where A j = ji-\ —pj. Also, using the definition of computation 
cost, 

ti c (T) = NClo < NC log(T). (20) 

Now, it 1 (T) = E Ajnj(T) where hj(T) is the 

number of times arm j has been played during the exploitation 
phases. For E 3 , 

h 

nj(T) = J2 2ll ( X j(0 > “fZ-WI 
1 = 1 
Jo 

< ^ 2 l I{A j (Z) >X 1 {l)j ( 21 ) 

1 = 1 

Similarly, for E 3 -TS, hj(T ) < Y^i°=i 2 ! I{0j( l) > 9i(l)}- Thus, 

lo n 

(T) < A max ^ -2 l P(Xi(l) < Xj(l)), (22) 

1=1 7=2 

to N 

and, 7^ 3 _ ts (T) < A max EE 2 l P(6 1 (l) < (23) 

1=1 7=2 

The following two lemmas bound the event probabilities 
above for the E 3 and E 3 -TS policies. 

Lemma 2. For E 3 , with 7 = |~ A 3 ], 

P(Ai(0 < Xj(l)) < 2e~\ Mj > 1. (24) 

Proof. The event {Xi(l) < X f l)} implies at least one of the 
following events: 

A := {Xj(l) - pj > Aj/2}, B j := {JMO - im < —Aj/2} 

Using the Chernoff-Hoeffding bound and choosing 7 = 
we 8 et - 

P{Aj) < e - 2 i 7 A j / 4 < e _i , and similarly, P(Bj) < e~ l . 

(25) 
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By the union bound, we get P(A'i(Z) < Xj(l)) < 2e l . 

□ 

Lemma 3. For E 3 -TS, with 7 ^ = |~ A | ], 

P(0i(0 < 6,(1)) < 4e _z , Vj > 1 . (26) 

Proof. Without loss of generality, we will assume the under¬ 
lying reward distributions of the arms to have a Bernoulli 
distribution to simplify the analysis. This eliminates the need 
for line 5 in the E 3 -TS policy illustrated in Algorithm [ 2 ] 
However, this assumption can be relaxed without any change 
to the results. 

As in Lemma [ 2 ] the event {$i(Z) < 9j(l)} implies at least 
one of the events: 

A j {@ji 0 — Fj > Aj/2}j Bj := {0i(Z) Mi < ~~Aj/2}. 

(27) 

Let rrij (l) denote the number of plays of arm j during 
the exploration phases after the Z-th exploration epoch, and 
let Sj(l) be the number of successes (r = 1) in these plays. 
Then, 9j(l ) is sampled from a f3(sj(l) + 1, rrij(l ) — Sj(l) + 1) 
distribution. 

Additionally, let A(l) denote the event { < Pj + %-}■ 

Then, 


P(%(0 > Fj + -f~) < P(A(0) +P(A'(0 > Fj + > Mi))■ 

(28) 


The first term in the expression. 


P(A(Z)) = p( 


Sj(l) 

rrij(l) 


A, 


>Fj + ~f)< 


) < exp ( 


16 /’ 


where the last inequality comes from the Chernoff-Hoeffding 
inequality and by noting that is a random variable with 
mean p 3 . Also, rrij(l ) = 7 ^ l. 

The second term. 


(l)>p j + ^-,A(l)) 


< 

= E 
< E 




#(0 


"(/ 3 ( s j (0 + !,^(Z) - Sj(l) + 1 ) 


> 


Sf (0 

m 3 (l) 


F 




PD , A 


te(0) 





(29) 


Here, (x) is the cdf of the binomial(n, p) distribution. 
The equality in the second-to-last line comes from the fact 

that F a,b( X ) = 1 - F a+b- l,x( a - !)> Where Eh 0*0 is the cdf 
of the /J(a, b) distribution | 8 ). The inequality on the last line 
is a standard inequality for binomial distributions. 

But, by the Chernoff-Hoeffding inequality, it can be seen 
that F^ p (np — nS) < exp(—2 nS 2 ). Thus, 


A r-2^/ s lA 2 ,\ 

P(%(0 > Fj + ~ 2 >A(t)) < exp (-(30) 

Setting 7 ^ := ] in ( [29] ) and ( |30| ), we get, 

¥(0 j (l)>p j + ^-)<2e- 1 . (31) 

Similarly, P($i(Z) < pi — ^f) < 2e~ l , and the claim of the 
lemma follows from the union bound. 

□ 


Continuing with the proof of Theorem 1, thus, 

lo N 00 

k(.CO<A„„EE 2 l 2e~ l < 2ZVA max EpW 

1=1 j =2 1=0 

< 21VA max /(l - (2/e)) < 8 AA max , (32) 

lo N 00 

2'4e" z < 4ZVA max E < 2 /«)‘ 

Z=1 j=2 Z=0 

< 4AA max /(l - (2/e)) < 16iVA max , (33) 
Now, combining all the terms, we get 
n E 3 (T) < log TA max + c log(T) + 8 AA max . (34) 

7 t E 3 -Ts(T) < A 7 ^logTA max +Clog(T) + 16iVA max . (35) 


B. A m ; n is unknown: 

Suppose ti be the time t at which Zth exploration phase 
begins. For the clarity of explanation, we assume that 7 
changes only in the beginning of an exploration phase. So, 
in the Zth exploration phase, each arms is played 7 tl times in 
a round robin manner. 

As in the proof given in the previous subsection, let T 
be in the Zoth exploitation epoch. By construction, T > 

n T!i°=i 7i, + 2 io - 2. Thus, logT > Z 0 and, 

lo N 

7 Z°(T) = ^ 7 t, ^ A j < ATA max 7 TZo < NA max 'jx log (T). 
1=1 . 7—2 

(36) 

The second inequality is from the fact that is a monotone 
increasing sequence. 

The computation cost is same as before, i.e., 

K C (T) = NCl 0 < NC log(T). 


lo N 

Using ([25}, 1Z{ 3 (T) < A max EE 2'P(X 1 (Z) < A,(Z)) 

Z =1 j =2 
00 

<M max ^ 2 ‘e- 61 ^^ 

i=i 


where Zq = A^ in /2. Since 74 —>■ 00 monotonically (and 7 tfc > 
1), there exists an Z' such that h\ X)L=i > Z, VZ > Z'. Then, 

(2/e)‘j 

< NA max B 


f l'-l 


X&(T) < N A„ 


2' e -HEL = i7t fc + y 


Z=Z' 
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where B is a finite constant, independent of T. 

When 7 tl = log" 5 ti for S £ (0,1), it is easy to see that j tl > 

l s ,\/l. Then, Ek=i j tk > E/c=i k& > fxt\( x ~ l Y dx > 
0.5Z( 1+ E From this, V = (2/6i) 1 / <5 . Then, we can get B = 
B(6) =2 V . 

Appendix B 
Proof of Theorem 2 

We first show that if A m ; n is known, we can choose an 
e < A min /(M + 1), such that dE 3 and dE 3 -TS algorithms 
achieve a logarithmic regret growth with T. If A m j n is not 
known, we can pick a positive monotone sequence {e t } such 
that e t —> 0, as t —>• oo. In a decentralized bipartite matching 
algorithm, the precision e will depend on the amount of 
information exchanged. 

The proof will be illustrated here only for the dE 3 policy 
since the differences between it and the analysis of the dE 3 -TS 
policy are similar to those found in Theorem [T] 

Let us denote the optimal bipartite matching with k** £ 
V(N) such that k** £ argmax ke p(jv) Eta Mi,k ; - Denote 
M** : = E^i k**, and define A k := i Mi.k,, k £ 

V(N). 

Let A m ; n = min k6 -p(jv),k^k** A k and 
A max = max k £-p(jy) A k . We assume A m i n > 0. 


Then, by using the union bound, 

l 0 M N 

(T) < A max E 2 ' E E P ( Af) 

1=1 i =1 j =1 

OO 

< A max 2MAE (2/e) Z = 2A max MIV/(l - (2/e)) 

1=0 

< 8MZVA max . (42) 

Combining all the terms, we get 

7 Ee3 (T) <MNj log (T) A max + MNC(e) log(T) 

+ 8MAA max . (43) 

In a similar manner, 

7 t E 3-Ts (T) <MN"/p log(T)A max + MNC(e) log(T) 

+ 16MiVA max , (44) 

where jp = |8M 2 /(A min - (M + l)e) 2 ]. 

B. A m ; n is unknown: 

The proof is similar to the proof of the analogous case of 
Theorem |T[ and is omitted. 


A. A m ; n is known: 

Let T be in the / () th exploitation epoch. It follows that, 
T > MN'ylo + 2 io — 2 and, hence, logT > l 0 . Then, 

ng s (T) = MNjl 0 A max < MJV 7 log(T)A max . (37) 

Also, by definition, 

ng?(T) = MNC{e)l o < MNC(e) log(T). (38) 
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A suboptimal matching occurs in the Z-th exploitation epoch 
if the event {EE x %,kr* (0 < (M + l)e + Ef=i x i,kr (01 
occurs. If each index has an error of at most e, the sum of M 
terms may introduce an error of at most Me. In addition, the 
distributed bipartite matching algorithm dBM e itself yields only 
an e-optimal matching. This accounts for the term (M + l)e 
above. 

Clearly, as in the single player case, 

l 0 M 

(T) < A max E 2 ‘ p (E^r(0 < ( M + !)e 

1=1 2—1 

M 

+ E^,fc*(0)- (39) 

2—1 
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The event {eE x i,k**(l) < (M + l)e + Ef=i x i,k*( 0} 

implies at least one of the following events 

A itj := {| Xij(l) - Midi > (A mi „ - (M + l)e)/2Af}, (40) 

for 1 < % < M, 1 < ■/ < A'. By the Chernoff-Hoeffding 
bound, and then using the fact that 7 = [2M 2 /(A m j n — (M + 

l)e) 2 l> 

P (Aij) < 2 e - 2 ^( Ami "-( M+1 ) £ ) 2 / 4M2 < 2e~ l . (41) 
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